The original Transformer architecture¶

taken from: https://arxiv.org/pdf/1706.03762.pdf

Important things to note are:

  • self-attention
  • positional encoding
  • skip-connections
  • encoder-decoder architecture

taken from https://medium.com/machine-intelligence-and-deep-learning-lab/transformer-the-self-attention-mechanism-d7d853c2c621

Why Self-Attention?

taken from: https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html

Self-attention allows the model to treat words with weighted importance:

  • key and query are some kind of similarity (cosine), every word is weighted by the query and the attention weights are computed
  • the ouput z is the weighted input

How Self-Attention works¶

taken from http://lucasb.eyer.be/transformer

$$\text{attention(Q, K, V)} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Multi-Head Self-Attention

Byte-Pair-encoding Tokenization¶

Transformer models do not see single characters:

taken from https://platform.openai.com/tokenizer

Architecture of some important Transformer models

taken from Lukas Beyer again

Training¶

taken from Chip Huyen

Datasets used for training¶

Colossal Cleaned Crawled Corpus (C4) This is 800GB of cleaned common internet crawl. https://github.com/google-research/text-to-text-transfer-transformer#c4

BookCorpus "The books have been crawled from https://www.smashwords.com, see their terms of service for more information."

Stack-Exchange preferences

Instruction Data-Sets

taken from Lilian Weng

Data-Sets "stolen" from ChatGPT

The project can be found here

Reinforcement-Learning with human feedback (RLHF)

Chain-Of-Thought (COT) Training

Data, the models where trained on¶

Lineage of Chat-GPT

taken from: How does GPT Obtain its Ability?

Google "We Have No Moat, And Neither Does OpenAI"¶

Alpaca¶

Distilling ChatGPT: "Our data generation process results in 52K unique instructions and the corresponding outputs, which costed less than $500 using the OpenAI API."

Vicuna¶

"After fine-tuning Vicuna with 70K user-shared ChatGPT conversations, we discover that Vicuna becomes capable of generating more detailed and well-structured answers compared to Alpaca"
Costs: $300

Here is a link to the 'open source' models and there performance.

What did the open-source communiy solve?¶

LORA (Low-Rank-Adaptation)
This is corresponding paper.

The pretrained weight matrix $\mathbf{W}$ is frozen during training. An additional weight matrix is trained with the two low-rank (r) matrices $\mathbf{A}$ and $\mathbf{B}$. Only these weights (orange) are update. The input-vector (dark-blue) is multiplied with the frozen weights as well as with the low-rank-adaption of the weight matrix. The Results are just added.
During training only the gradients for the orange matrices have to be kept in GPU-memory.

taken from Sebastian Raschka

Bits and Bytes
Tim Dettmers et al., 2022

QLora by Dettmers et al., 2023
In few words, QLoRA reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning. This method enables 33B model finetuning on a single 24GB GPU and 65B model finetuning on a single 46GB GPU. see here

illustration taken from here

LLaMA cpp¶

Plain C/C++ implementation without dependencies

  • LLMs quantized on edge devices
  • on Apple hardware
  • no training, just inference

Prompt Engineering¶

Prompt Engineer will not be a job to stay.

From Mishra et al., 2022:

  • Use Low-level Patterns: Instead of using terms that require background knowledge to understand, use various patterns about the expected output.
  • Itemizing Instructions: Turn descriptive attributes into bulleted lists. If there are any negation statements, turn them into assertion statements.
  • Break it Down: Break down a task into multiple simpler tasks, wherever possible.
  • Enforce Constraint: Add explicit textual statements of output constraints.
  • Specialize the Instruction: Customize the instructions so that they directly speak to the intended output.

Prompting is only to imitate as well as possible the training data.

Fiction
Since most models are also trained on one or several book corpora, they can also be prompted to take on a fictional persona.

Zero-Shot CoT prompting
Let's think step by step

Remember Byte-Pair-Encoding:

For more funny examples with "O" see this twitter feed.

Why Large Language Models can not calculate with large numbers:

The ultimate solution:¶

Image('../images/rip_prompt_engineer.png')